Skip to main content

Overview

The EDL pipeline runs automatically every weekday at 4:00 PM IST via GitHub Actions. The workflow fetches fresh market data, processes it through all pipeline stages, and commits the compressed output back to the repository.

Workflow Configuration

The workflow is defined in .github/workflows/daily_refresh.yml:
name: Daily Data Refresh

on:
  schedule:
    - cron: '30 10 * * 1-5' # Runs at 4:00 PM IST (10:30 UTC) Mon-Fri
  workflow_dispatch: # Allows manual trigger from GitHub UI

permissions:
  contents: write

jobs:
  refresh-data:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout Repository
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'

      - name: Cache OHLCV Data  
        uses: actions/cache@v4
        with:
          path: |
            DO NOT DELETE EDL PIPELINE/ohlcv_data
            DO NOT DELETE EDL PIPELINE/indices_ohlcv_data
          key: ohlcv-v1-${{ runner.os }}-${{ hashFiles('DO NOT DELETE EDL PIPELINE/master_isin_map.json') }}
          restore-keys: |
            ohlcv-v1-${{ runner.os }}-

      - name: Install Dependencies
        run: |
          python -m pip install --upgrade pip
          pip install requests pandas beautifulsoup4

      - name: Run Pipeline
        run: |
          cd "DO NOT DELETE EDL PIPELINE"
          python run_full_pipeline.py

      - name: Commit and Push Results
        run: |
          git config --global user.name "GitHub Actions"
          git config --global user.email "actions@github.com"
          git add "DO NOT DELETE EDL PIPELINE/all_stocks_fundamental_analysis.json.gz"
          git add "DO NOT DELETE EDL PIPELINE/sector_analytics.json.gz"
          git add "DO NOT DELETE EDL PIPELINE/market_breadth.json.gz"
          git add "DO NOT DELETE EDL PIPELINE/all_indices_list.json"
          git commit -m "Automated Daily Data Refresh [skip ci]" || echo "No changes to commit"
          git pull --rebase --autostash origin main
          git push

Schedule Configuration

Cron Schedule

schedule:
  - cron: '30 10 * * 1-5'
Breakdown:
  • 30 10: 10:30 AM UTC
  • * * 1-5: Every day, every month, Monday through Friday
Time conversion:
  • UTC: 10:30 AM
  • IST: 4:00 PM (UTC + 5:30)
Why 4:00 PM IST?
  • NSE closes at 3:30 PM IST
  • 30-minute buffer ensures all settlement data is available
  • Corporate actions and announcements are typically posted by 4 PM
The workflow only runs on weekdays (Monday-Friday) since Indian stock markets are closed on weekends.

OHLCV Caching Strategy

The workflow uses GitHub Actions cache to persist OHLCV data between runs, reducing execution time from ~30 minutes to ~5 minutes.

Cache Configuration

- name: Cache OHLCV Data  
  uses: actions/cache@v4
  with:
    path: |
      DO NOT DELETE EDL PIPELINE/ohlcv_data
      DO NOT DELETE EDL PIPELINE/indices_ohlcv_data
    key: ohlcv-v1-${{ runner.os }}-${{ hashFiles('DO NOT DELETE EDL PIPELINE/master_isin_map.json') }}
    restore-keys: |
      ohlcv-v1-${{ runner.os }}-

How It Works

1

Cache Key Generation

The cache key includes:
  • Version: ohlcv-v1 (bump to invalidate all caches)
  • OS: ${{ runner.os }} (ubuntu-latest)
  • ISIN Map Hash: ${{ hashFiles('master_isin_map.json') }}
Example key: ohlcv-v1-Linux-8a3f2c9d...
2

Cache Restoration

Before running the pipeline:
  • Exact match: Restores OHLCV data for current stock universe
  • Fallback: Uses partial match ohlcv-v1-Linux- if ISIN map changed
3

Incremental Update

fetch_all_ohlcv.py detects existing files and only fetches:
  • New trading days for existing stocks
  • Full history for newly listed stocks
4

Cache Saving

After successful pipeline run:
  • Updated OHLCV data is saved to cache
  • Available for next workflow run

Cache Invalidation

The cache is automatically invalidated when:
EventReasonImpact
New stock listedmaster_isin_map.json changesCreates new cache key
Stock delistedmaster_isin_map.json changesCreates new cache key
Cache version bumpedManual ohlcv-v1ohlcv-v2Forces fresh download
7 days of inactivityGitHub cache eviction policyOld cache deleted
GitHub Actions caches are limited to 10 GB per repository. The OHLCV cache typically uses ~500 MB (2,775 stocks × ~180 KB per CSV).

Manual Trigger

You can manually run the workflow outside the scheduled time using workflow_dispatch.
1

Navigate to GitHub Actions

Go to your repository → Actions tab
2

Select the workflow

Click Daily Data Refresh from the workflows list
3

Run manually

Click Run workflow → Select branch (usually main) → Run workflow
4

Monitor execution

Watch real-time logs to track pipeline progress
Use cases for manual triggers:
  • Testing workflow changes
  • Refreshing data after market hours outside schedule
  • Recovering from a failed automated run
  • Forcing a fresh data pull after API changes

Files Committed

After successful pipeline execution, the workflow commits:
git add "DO NOT DELETE EDL PIPELINE/all_stocks_fundamental_analysis.json.gz"
git add "DO NOT DELETE EDL PIPELINE/sector_analytics.json.gz"
git add "DO NOT DELETE EDL PIPELINE/market_breadth.json.gz"
git add "DO NOT DELETE EDL PIPELINE/all_indices_list.json"
FileSizeRecordsDescription
all_stocks_fundamental_analysis.json.gz~2 MB2,775Complete stock analysis (86 fields/stock)
sector_analytics.json.gz~8 KB12Sector-wise aggregated metrics
market_breadth.json.gz~10 KB1Market-wide breadth indicators
all_indices_list.json~85 KB194All market indices (uncompressed)
Total committed size: ~2.1 MB per commit

Commit Behavior

git commit -m "Automated Daily Data Refresh [skip ci]" || echo "No changes to commit"
  • [skip ci]: Prevents triggering another workflow run
  • || echo "No changes": Prevents failure if data is identical to previous run
  • Rebase strategy: Uses git pull --rebase --autostash to avoid merge commits

Workflow Execution Time

ScenarioDurationCache Status
First run~35 minutesNo cache (full OHLCV download)
Daily refresh~5-7 minutesCache hit (incremental update)
New stock added~6-8 minutesPartial cache hit
Cache invalidated~35 minutesNo cache (full rebuild)
Phase breakdown (with cache):
PHASE 1 (Core Data):           ~45 seconds
PHASE 2 (Enrichment):          ~90 seconds
PHASE 2.5 (OHLCV Incremental): ~2 minutes
PHASE 3 (Analysis):            ~30 seconds
PHASE 4 (Injection):           ~45 seconds
PHASE 5 (Compression):         ~5 seconds
─────────────────────────────────────────
Total:                         ~5-7 minutes

Monitoring and Debugging

Viewing Logs

  1. Go to Actions tab
  2. Click the workflow run
  3. Expand each step to see detailed logs

Common Issues

Cause: OHLCV data + intermediate files exceed runner disk space.Solution:
# In run_full_pipeline.py
CLEANUP_INTERMEDIATE = True  # Ensure this is enabled
Cause: Cache key changed or cache expired.Solution: First run after key change will be slower. Subsequent runs will use new cache.
Cause: Data identical to previous run (rare, usually weekends).Solution: This is expected behavior. The || echo prevents failure.
Cause: Network issues or API rate limiting.Solution: Manually re-run the workflow. Check API endpoints for outages.

GitHub Actions Limits

Be aware of GitHub Actions usage limits:
  • Free tier: 2,000 minutes/month
  • Daily runs: ~7 minutes × 22 workdays = ~154 minutes/month
  • Buffer: ~1,846 minutes for manual runs and retries
Cost optimization:
  • OHLCV caching reduces runtime by 85% (35 min → 5 min)
  • CLEANUP_INTERMEDIATE = True reduces storage usage
  • [skip ci] in commit message prevents recursive triggers

Security Considerations

Repository Permissions

permissions:
  contents: write
The workflow requires write access to commit updated data files.

No Secrets Required

All data sources used by the pipeline are publicly accessible APIs:
  • Dhan ScanX endpoints
  • NSE Archives
  • No authentication tokens needed
If you fork this repository, ensure Actions are enabled in repository settings and the workflow file is present in .github/workflows/.